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AUTOREGRESSIVE  SPECTRAL  ESTIMATION  AND 
FUNCTIONAL  INFERENCE 

Emanuel  Parzen 

Institute  of  Statistics,  Texas  A&M 
University,  College  Station,  TX  77843  USA 


The  theory  of  statistical  spectral  analysis  has  a  long  tradition  of 
applications  to  the  sciences  of  oceanography,  underwater  sound,  anti-submarine 
warfare,  and  signal  processing.  The  probability  theory  of  spectral  analysis  was 
pioneered  by  Norbert  Wiener  (1930)  under  the  stimulus  of  the  question  of  how  to 
describe  mathematically  the  randomly  changing  surface  waves  on  the  Charles  River 
on  which  he  daily  gazed  from  his  office  window  at  M.I.T.  The  statistical  theory 
of  spectral  analysis  began  to  be  developed  in  the  1950* s,  and  was  early  applied 
to  the  ocean  sciences  by  Walter  Munk  (1957)  who  was  in  close  touch  with  relevant 
research  by  statisticians  such  as  Blackman  and  Tukey  (1959),  Goodman  (1957),  and 
Parzen  (1957,  1958,  1961).  (Parzen's  research  was  stimulated  by  his  work  at 
Hudson  Laboratories  of  Columbia  University,  an  ONR  sponsored  project  on  under¬ 
water  sound).  Carter  and  Nuttall  (1981)  describe  the  interests  of  the  navies  of 
the  world  in  signal  processing  systems  to  detect,  classify,  and  localize  the 
sources  of  signals. 

A  discussion  of  the  naval  applications  of  time  series  analysis  requires 
attention  to  the  unique  characteristics  of  signals  and  noises  encountered  in  the 
naval  environment,  a  discussion  of  quantization  and  time/frequency ‘ sampling 
techniques.  These  topics  are  not  considered  in  this  paper  which  aims  to  discuss 
abstractly  the  statistial  analysis  of  a  discrete  parameter  time  series  Y(t), 
t=0,  +1,...  .  We  aim  to  make  the  ocean  scientist  aware  of  new  ways  of  thinking 
about  statistical  problems  that  are  provided  by  learning  how  to  formulate 
statistical  inference  problems  as  functional  Inference  problems  that  can  be 
approached  using  techniques  inspired  by  the  pioneering  research  on  spectral 
density  estimation,  especially  research  on  autoregressive  spectral  estimation 
and  maximum  entropy  spectral  estimation.  I  use  "functional  inference"  to  denote 
a  special  case  of  "abstract  inference"  as  defined  by  Grenander  (1981). 

To  describe  what  is  meant  by  functional  inference,  note  that  many  statistical 
problems  are  concerned  with  the  estimation  of  the  parameters  of  probability 
models  (models  for  the  probability  distributions  of  observed  samples).  The 
parameters  are  usually  defined  to  be  a  finite  dimensional  vector  6  «=  (6^, . . . ,8^) . 
Functional  inference  defines  the  parameters  to  be  functions  [such  as  f(u), 

0<u<l]  on  the  unit  interval,  or  the  unit  square,  or  the  unit  hyper-cube  in 
higher  dimensions. 

Functions  used  to  describe  the  probability  distributions  of  time  series  (both 
Gaussian  and  non-Gaussian)  are  introduced  (section  1) .  The  concept  of  type  of  a 
time  series  is  defined  (section  2).  Autoregressive  spectral  densities  are 
defined  (section  3).  Order  determining  criteria  are  motivated  (section  5) 
through  the  concept  of  model  identification  by  estimating  information.  An 
approach  to  empirical  spectral  analysis  is  suggested  (section  4). 

1.  Describing  Gaussian  and  Non-Gaussian  Stationary  Time  Series 

To  say  that  a  function  Y(t),  t*0,  +1,...  is  a  time  series  (or  a  stochastic 
process  or  a  random  function)  is  to  say  that  the  function  Y(t)  does  not  have  at 
each  time  t  a  definite  value  but  has  an  ensemble  of  values  which  it  assumes  in 
This  research  was  supported  by  the  Office  of  Naval  Research  (Contract 
N00014 - 81-MP-10001 ,  ARO  DAAG29-80-C0070)  . 
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accordance  with  a  probability  distribution.  To  describe  the  probability  distri¬ 
bution  of  a  continuous  random  variable  Y(t)  one  uses  one  (or  more)  of  the 


distribution  function 

FY(t)(y) 

-Pr[Y(t)<y], 

probability  density  function' 

fY(t) 

"  dy  FY(t)(y) 

quantile  function 

QY(t)(u) 

■  Fi(t)(u> 

quantile  density  function 

qY(t)(u) 

“  du  QY(t)*u) 

density-quantile  function 

f(^Y(t)  ^ 

=  {qY(t)(u)} 

-1 


fY(t)  {QY(t)(u)} 


y(t)  =  E(Y(t>], 

K(s,t)  «  Cov  [Y£s),  Y(t) ] . 
o(t)  -  (K(t,t)} * 


m 


.Moments  are  described  by 
mean  value  function 
Covariance  kernel 
standard  deviation  function 
A  time  series  is  called  Gaussian  if  any  linear  combination  ][  c^Y(t^)  is 

normally  distributed 
are  denoted 

(1)  *(y)  =  /Y  <Ky)  dy,  <p(y) 


i*=l 

The  standard  Gaussian  distribution  and  density  functions 

JL_  „-(%)y2 


/u 


-i, 


The  standard  Gaussian  quantile  function  is  denoted  ♦  (u).  Then 


(2) 


Y(t) 


QY(t)(“)  =  w(t)  +  o(t)  # 


-1 


(u). 


The  time  series  is  called  covariance  stationary  if  K(s,t)  is  a  function  only 
of  | s— t | .  We  then  define  the  covariance  function. 

(3)  R(v)  «  K(t,  t+v),  v=0,  +1,...  . 

It  is  a  non-negative  definite  function,  in  the  sense  that 
n 

(4) 


t  Vj  R<vrV  i  0 

for  any  integer  n,  any  complex  numbers  c 
to  denote  the  complex  conjugate  of  c. 


c  ,  and  lags  v. ,...,v  .  We  use 
in  in 


When  £  |R(v) |  <  “,  we  define  the  power  spectrum  by 

V-T.OD  m 

(5)  S(u))  *  £  e  1Tiv“  R(v),  0<u<l. 

yx-to 

The  frequency  variable  u  is  usually  assumed  to  vary  in  the  interval  -0.5<«<0.5, 
but  only  the  interval  0<u)<0.5  has  physical  signficance.  We  prefer  to  take  0<w<l 
as  the  domain  of  u,  because  this  choice  is  more  convenient  for  developing  iso¬ 
morphisms  between  spectral  analysis  and  non-parametric  data  modeling  using 
quantile  and  density-quantile  functions  [introduced  in  Parzen  (1979)).  The 
spectral  representation  of  R(v)  is 

(6)  R(v)  -  J1  e2*1™  S(u>)  dw. 

o 

In  my  judgement  the  results  of  a  statistical  analysis  of  a  time  series  are 
most  insightful  when  expressed  in  units  of  R(0),  the  variance.  Thus  the 
covariance  function  R(v)  is  replaced  by  the  correlation  function 

(7)  P<v)-j[^«  Corr  [Y(t),  Y(t-*v)]. 
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The  probability  law  of  a  Gaussian  time  series  is  completely  determined  by  the 
mean  value  function  and  the  covariance  kernel.  The  correlation  structure  of  the 
time  series  is  described  by  the  correlation  function  p(v)  and  various  functions 
(spectral  density,  and  autoregressive-moving  average  schemes)  equivalent  to  p( 
For  non-Gaussian  time  series  the  dimensionless  dependence  between  pairs  of 
variables  Y(t)  and  Y(C+v)  is  described  by  a  sequence  of  dependence 
functions  dv(u^,u2)  defined  as  follows:  for  0^u^,u2<l 

FY(t),  Y(t+v)*QY(t)(ul)’  QY(t+v)*u25) 

.2 


(8) 


dv<Vu2> 


(9) 


VvV 


3u^3u2 


VvV* 


One  can  show  that  in  the  normal  case 
(10)  p(v) 


In  ll  *  dD„(u,,u,) 


t  o  j0  -  '  l7  x  2  vv  1*  2 

Other  measures  of  dependence  that  can  be  considered  Include ( 

(11)  pU(v)  “  /q  Jg  (Uj^-0.5)  (u2-0.5)  dDv(ultu2) 

which  is  the  correlation  function  of  the  transformed  time  series 

(12)  U(t)  -  Fy(t)  (Y(t)), 

called  the  rank- transform  of  the  original  time  series.  To  form  U(t)  one  appears 
to  need  to  know  the  distribution  function  of  Y(t);  in  the  stationary  case,  one 
can  estimate  it,  and  therefore  U(t),  from  the  sample.  The  theory  and  application 
of  U(t)  requires  further  development.  But  in  my  judgement  it  provides  a  basis 
for  a  functional  inference  point  of  view  to  develop  methods  of  time  series 
analysis  that  are  not  based  entirely  on  the  Gaussian  model. 

In  this  paper  we  discuss  methods  of  time  series  analysis  based  on  the  con¬ 
ventional  correlation  function  p(v).  It  is  a  non-negative  definite  function 
satisfying  p(0)  =  1.  Therefore  there  exists  a  function  F(w),  0<w<l,  called  the 
spectral  distribution  function  such  that 

(13)  p  (v)  =  J*  e2*1™  dF  (u) 

If  p(v)  is  summable,  then  the  derivative  f(u)  *  F'(w)  can  be  explicitly 
calculated  by  ~  «  . 

(14)  f(u>)  -  l  e  “pfv),  0<ok1. 

y=-oo 

We  call  f(u)  the  spectral  density  function. 

Note  that  F(0)  *=  0,  F(l)  =  1,  and 

(15)  F(w)  *  /U  f(w')  dm',  0<u><1 

Also  F(0.5)  *  0.5  since  f(m)  is  an  even  function  in  the  sense  that  f (l-«)«f (w) . 
When  analyzing  data,  one  computes  a  modified  spectral  distribution  function  F+(w) 
which  is  defined  for  0<u<0.5  by  F+(w)  *  2  F(u>). 

For  a  spectral  density  f(m)  obeying  suitable  conditions,  one  can  define  the 
inverse-correlation  function  [see  Cleveland  (1972),  Farzen  (1974),  Chatfield 

\  /  1  2rlvu>  --1.  .  .  .  fl  ,-l,  .  . 

(16)  pi(v)  «/Q  e  f  (w)  dm  t  f  (u)  dot 

and  the  cepstral-correlation  function  [see  Wahba  (1980)  for  an  application] 

2nlvu  .  .  , 

log  f(u)  du 

It  should  be  noted  that  the  inverse-correlation  function  is  non-negative 
definite.  However  the  cepstral-correlation  function  is  not.  These  new  types  of 
correlation  functions  are  introduced  because  they  may  provide  more  parsimonious 
parametrizations  in  the  sense  that  they  decay  to  0  faster  than  does  the 
correlation  function.  Statistical  inference  (from  a  sample)  of  the  probability 


(17)  y(v)  -  J*  e 
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law  of  a  time  series  often  achieves  greatest  statistical  efficiency  by  using  the 
most  parsimonious  parametrizations.  Thus  to  form  estimators  t(u>)  of  the  spectral 
density  f(u)  from  a  raw  estimator  f(u>),  greater  precision  may  be  attained  by 
first  forming  estimators  {f"l(u)}“  and  {log  f(u))}*  of  the  inverse  or  logarithm  of 
the  spectral  density.  Autoregressive  spectral  estimation  may  be  regarded  as  an 
approach  to  estimating  f(u)  by  first  estimating  f-1  (t*>)  [Durrani  &  Arslanlan(1982)]. 


2.  Memory  Type  of  a  Time  Series 

One  approach  to  the  comparative  behavior  of  spectral  estimation  methods  is  an 
empirical  one  which  compares  estimators  computed  for  various  examples  of  time 
series;  see  Beamish  and  Priestley  (1981)  and  Kay  and  Marple  (1981).  To  draw 
general  conclusions  from  such  studies  it  is  necessary  to  define  types  of  time 
series.  The  notion  of  memory  is  introduced  in  Parzen  (1982)  to  define  types. 

Memory  refers  to  how  far  in  the  future  (after  a  specified  time  t0)  one  can 
effectively  predict  the  time  series  given  observations  on  its  infinite  past  (up 
to  tD) .  The  infinite  memory  one  step  ahead  predictor  of  a  Gaussian  zero  mean 
time  series  is  denoted  Y^(t)  **  E[Y(t) |Y(t-l),  Y(t-2),...],  with  prediction  error, 
or  innovation,  Yv(t)  *  Y(t)  -  Y^(t).  The  innovations  have  a  basic  property  of 
being  white  noise,  and  are  also  denoted  e(t).  The  mean  square  prediction  error, 
measured  in  units  of  the  variance  R(0)  of  Y(t),  is  denoted  o2  *  E [ | Yv ( t ) | 2 ] 
i  E[  |Y(t)|2],,  and  satisfies 

(1)  log  o2«  /*  log  f(w)  dm 


when  Y(*)  is  a  stationary  Gaussian  zero  mean  time  series.  We  classify  such  time 
series  as  follows: 

(1)  a  no-memory,  or  white  noise,  time  series  if  it  satisfies  either  of  the 

equivalent  conditions:  p(v)  *  0  for  v>0;  f(u>)  “  1,  0<«<1;  a2  =  1. 

(2)  a  short  memory  time  series  if  its  correlation  function  p(v)  is  summable, 
and  its  spectral  density  function  f(u>)  is  bounded  above  and  below  in  the  sense 
that  the  dynamic  range  of  f(w) 

(2)  DR(f)  «  “*<;L  f(u.)  ^  f(u>) 


0<o)<l 


satisfies  l<DR(f)  <  Then  0<oj<l. 

(3)  a  long  memory  time  series  if  it  is  neither  no  memory  nor  short  memory, 
and  o£  »  0.  Alternatively  a  long  memory  time  series  is  one  which  is  non¬ 
stationary  or  non-ergodic.  It  usually  has  components  representing  cycles  or 
trends.  An  example  of  a  long  memory  time  series  is 
(3)  Y(t)  -  A  cos  27Tto0 1  +  B  sin  2nu)0t, 

where  A  and  B  are  independent  N(0,a2)  random  variables;  its  correlation  function 
p(v)  ”  cos  2iru)0v  is  not  summable,  and  a£  “  0.  A  band  limited  time  series  has  a 
correlation  function  that  decays  to  0  as  v  tends  to  •  at  the  rate  of  1/v;  it 
has  long  memory. 

An  important  example  of  a  long  memory  time  series  arises  when  Y(t)  ■  S(t) 
+N(t),  where  the  signal  S(t)  is  long  memory  and  the  noise  N(t)  is  short  memory  or 
no  memory.  Kay  and  Marple  (1981),  p.  1408  consider  a  signal  which  is  the  sum  of 
three  sinusoids  at  frequencies  .10,  .20,  and  .21  respectively,  and  signal  to 
noise  ratios  (SNR)  of  10,  30,  and  30  dB  respectively  where  SNR  is  defined  as  the 
ratio  of  the  sinusoid  power  to  the  total  power  in  the  noise  process. 

Our  approach  to  time  series  analysis,  and  spectral  analysis,  of  an  observed 
sample  Y(t),  t-1,  2,...,T  suggests  that  the  first  step  is  to  identify  the  memory 
type  (no,  short,  and  long)  of  the  time  series. 

A  no  memory  time  series  requires  no  further  analysis;  its  spectral  density 
is  a  constant. 

A  short  memory  time  series  is  analyzed  by  representing  it  (or,  more 
realistically,  approximating  it)  by  an  ARMA(p,q)  scheme: 
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Y (t)  +  ap(l)  Y(t-l)+...+ap(p)  Y(t-p) 

-  e(t)  +  Bq(D  e(t-l)+. .  .+8(j(q)  e(t-q) 


where  the  polynomials 

g  (z)  “  1+a  (l)z+.  ..+a  (p)  zP 

/n  P  P  ■  P  a 

'  '  hp(z)  -  l+eq(l)z+...+8q(q)  Zq 

are  chosen  so  that  all  their  roots  In  the  complex  z-plane  are  In  the  region 
{z:|z|>l)  outside  the  unit  circle.  Then  g  (z)  and  h  (z)  are  the  transfer 
functions  of  invertible  filters.  c(t)  is  Sssuraed  toqbe  a  white  noise  time  series 
which  we  identify  with  the  innovations  e(t)  “  Yv(t); 

(6)  02>q  =  E[e2 (t) ]  *  E[Y2(t)] 

is  an  estimator  of  o2.  The  spectral  density  of  an  ARMA(p,q)  scheme  is 


(7) 


f  (w) 

p.q 


|h<](e2’1“)|2 

’p,q  |g  f.2'1")!2 


The  process  of  identifying  ARMA  (p,q)  schemes  which  are  adequate  (and 
parsimonious)  approximating  models  for  a  time  series  can  be  studied  rigorously, 
and  various  at  least  semi-automatic  methods  are  available  which  are  based  on 
order  determining  schemes. 

A  long  memory  time  series  Y(-)  is  usually  analyzed  by  representing  it  as: 

(a)  the  sum  S(-)  +  N(*)  of  a  long  memory  signal  and  a  short  memory  noise,  or  (b) 
as  transformable  to  a  short  memory  time  series  by  a  non-invert ible  filter. 

Methods  for  finding  such  representations  of  long  memory  time  series  are  currently 
not  as  well  developed  as  methods  of  finding  ARMA  representations  of  short 
memory  time  series. 


3.  Autoregressive  Spectral  Densities 

Our  functional  inference  approach  to  statistical  problems  is  based  on  the 
premise  that  estimation  of  a  function  is  similar  to  the  problem  of  model 
identification.  A  typical  problem  of  statistical  model  identification  seeks  to 
find  the  smallest  number  of  significantly  non-zero  components  0.  of  a  finite 
dimensional  parameter  6  =  (0^, . . . ,0^) .  A  typical  problem  of  function  estimation 
seeks  to  find  the  smallest  finite  nus&er  of  parameters  which  provide  an  adequate 
("parsimonious")  approximation  of  a  function  when  the  function  can  be 
parametrized  exactly  by  a  countable  infinity  of  parameters. 

The  goals  of  functional  inference  and  model  identification  are  in  my  view 
best  pursued  simultaneously.  One  uses  finite  parameter  models,  but  one 
estimates  their  parameters  by  methods  that  can  be  interpreted  as  providing  either 
exact  models  or  approximating  models. 

Autoregressive  spectral  estimation  is  a  technique  for  spectral  analysis 
based  on  using  autoregressive  schemes  both  as  exact  models  and  as  approximating 
models  for  stationary  time  series.  An  exact  model,  with  an  infinite  number  of 
parameters,  for  a  short  memory  zero  mean  Gaussian  stationary  time  series  Y(t)  is 
provided  by  the  invertible  filters  which  relate  Y(t)  and  the  white  noise  series 
of  innovations  Yv(t).  The  AR(*»)  representation  is 

(1)  Y(t)  +  <x„(l)  Y(t-1)+...  -  Yv(t); 
the  MA(”)  representation  is 

(2)  Y(t)  -  Yv(t)  +  8.(1)  Yv(t-1)+... 

The  coefficients  8.(k)  are  determined  recursively  from  aw(j)  by  8.(0)  ■  1,  and 
for  k>0  k 

8.00  +  B.(k-j)  -  0 


The  AR(«)  and  MA(«>)  representations  have  important  implications  for  spectral 


(3) 
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analysis  since  they  provide  formulas  for  the  spectral  density  function  f(u) 
alternative  to  the  formula  that  f(u)  is  the  Fourier  Transform  of  p(v).  One  can 
show  that 

(4)  f(w)  -  of|ha>(e2,Tiu,)|2  , 

(5)  -  o;2u.(.2'1“)i2  . 

defining 

(6)  h  (z)  “  I  B  (J)  g  (z)  -  l  a  (k)  • 

j-0  k=0 

The  AR(»)  transfer  function  gm(z)  can  be  shown  to  be  the  limit  of  AR(m) 
transfer  functions 

(7)  g_(z)  “l+o  (1)  z+. ..+a  (m)  z™ 

~m  m  m 

whose  coefficients  a  (j)  are  the  coefficients  of  memory  m  prediction  errors 
defined  as  follows:  m 

(8)  Y»J*®(t)  “  E[Y(t)|Y(t-l),...,Y(t-m)] 

(9)  Yv»m(t)  “  Y(t)  -  YV.“  (t) 

“  Y(t)  +  om(l)  Y(t-l)+...+om(m)  Y(t-m), 

(10)  oj  “  E[|Yv.®(t)|2]  *  E[Y2(t)] 

=  E[Yv,m(t)  Y(t) ]  t  E[Y2 (t) ] 

Explicit  formulas  for  the  foregoing  predictors,  prediction  errors,  and  mean 
square  prediction  errors  can  be  obtained  from  the  correlation  function  p(v).  A 
predictor  is  characterized  by  the  condition  that  the  prediction  error  is 
orthogonal  (normal)  to  the  predictor  variables: 

(11)  E[Yv»m(t)Y(t-k) ]  -  0,  k“l,...,m 

By  substituting  (9)  into  (11)  one  obtains  the  famous  Yule-Walker  equations, 
defining  om(0)  =  1 


(12)  I  a  (j)  p(j-k) 

J-°  , 

One  obtains  by 


0,  k“l,...,m 


m 

(13)  -  l  c.  (j)  p(j). 

j-0 

For  a  short  memory  time  series,  these  equations  also  hold  with  msa>. 

The  notion  of  a  time  series  Y(*)  being  an  autoregressive  scheme  of  order  p, 
denoted  AR(p),  can  be  defined  in  terms  of  prediction  theory  as  follows:  Y(*)  is 
an  AR(p)  if  and  only  if  the  memory  p  prediction  errors  YV»P(*)  is  a  white  noise 
time  series  and  ap(p)  j*  0.  The  spectral  density  of  YV»P(*)  can  be  expressed  in 
terms  of  gp(z),  tne  autoregressive  transfer  function  of  order  p,  by 

(14)  f  (w)  -  -j  |g^(e2lfiw)  | 2  f(w) 

yv,p  ^  p 

P 

If  the  time  series  Y(*)  is  in  fact  AR(p),  then  its  spectral  density  equals  the 
function 


(15) 


l  .  2itiu)v  | 

l*p<*  >1 


-2 


fo(»)  -  o‘ 

P  P 

which  we  call,  in  general,  the  approximating  autoregressive  spectral  density  of 
order  p.  A  time  series  Y(*)  can  be  regarded  as  approximated  by  an  AR(p)  if 

<1H  V->  • 

can  be  regarded  as  not  "significantly"  different  from  a  constant.  In  this  way  a 
test  of  the  hypothesis  that  a  time  series  Y(*)  is  AR(p)  can  be  converted  to  a 
test  of  the  hypothesis  that  the  prediction  error  time  series  is  white  noise. 
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The  sequence  of  approximating  autoregressive  spectral  densities  fn(u), 
m-1 , 2 , .  • .  may  be  shown  to  converge  as  m  tends  to  •»  to  f(w)  at  each  w  In  Oku<l 
under  suitable  conditions  (see  especially  Neval  (1979)).  Sufficient  conditions 
are  that  f(w)  has  finite  dynamic  range  (and  therefore  is  bounded  above  and  below) 
and  has  a  continuous  derivative.  When  an  estimator,  denoted  fm(a>),  of  fm(to)  Is 
used  as  an  estimator  of  f(u>),‘  one  has  to  take  into  account  two  kinds  of  errors, 
called  respectively  bias  and  variance.  Bias  is  a  measure  of  the  deterministic 
difference  between  fm(ui)  and  f(u>),  while  variance  is  a  measure  of  the  stochastic 
distance  between  fm(u)  and  fm(u).  As  m  increases  bias  decreases  while  variance 
increases.  This  is  an  example  of  the  fundamental  problem  of  empirical  spectral 
analysis  which  is  how  to  achieve  an  optimal  balance  between  bias  and  variance. 
When  one  uses  autoregressive  spectral  estimation,  this  problem  reduces  to  a 
question  of  determining  the  order  m  of  the  approximating  autoregressive  scheme. 
The  convergence  of  autoregressive  spectral  estimators  is  studied  by  Berk  (1974), 
Carmichael  (1976),  and  Kromer  (1969). 

Autoregressive  spectral  densities  can  be  justified  philosophically  as  the 
solution  to  the  "maximum  entropy"  problem:  among  the  spectral  densities  f(o>) 
satisfying 


(17)  f1  e2lllvw  f(w)  du  -  P(v),  v=0,+l, . . .  ,+m 

°  —  ~  -1 
for  specified  correlation  p(v),  v=0,+l, . . . ,+m,  the  one  which  maximizes  Jologf(w) 

(18)  fm<*)  »  a2|gm(e2iri“)|-2 

where  gm(z)  is  given  by  (7)  in  terms  of  coefficients  ot^j)  computed  by  the  Yule- 
Walker  equations  (12).  This  fundamental  fact,  related  to  Burg  (1968),  was  proved 
by  VanDenBos  (1971).  A  new  proof  which  avoids  the  calculus  of  variations  is 
outlined  by  Parzen  (1981). 

The  maximum  entropy  principle  provides  a  motivation  or  justification  for  the 
use  of  autoregressive  spectral  estimators.  However  the  maximum  entropy  principle 
provides  no  insight  into  how  to  identify  an  optimal  order  m,  or  even  what  are  the 


effects  of  different  methods  of  estimating  the  parameters  am, 


It 


provides  no  guidance  for  how  to  learn  from  the  data  whether  the  time  series  is 
non-stationary  (long  memory)  or  stationary  (short  memory),  or  whether  the  best 
time  series  model  is  AR,  MA,  or  ARMA.  In  my  view  it  is  a  principle  for  deriving 
probability  models,  rather  than  statistically  fitting  models  to  data. 

It  should  be  realized  that  the  maximum  entropy  principle  justifies  auto¬ 
regressive  estimators  only  for  short  memory  time  series  (for  whom  log  f(w)  is 
integrable) .  Autoregressive  estimators  are  justified  for  l^ng  memory  time  series 
by  the  fact  that  a  pure  harmonic  Y(t)  =  A  cos  t  +  B  sin  pp-  t  satisfies  Y(t)- 

4>Y(t-l)+Y(t-2)=0  where  $  =  2  cos 


4.  An  Approach  to  Empirical  Spectral  Analysis 

The  current  status  of  the  search  for  the  perfect  spectral  estimator  seems  to 
be  as  follows:  spectral  estimation  is  not  a  non-parametric  procedure,  which  can 
be  conducted  independently  of  model  identification,  but  is  an  infinite-parametric 
procedure.  Kay  and  Marple  (1981),  in  their  excellent  survey  of  modem  spectral 
estimation,  seem  to  also  come  to  this  conclusion. 

The  memory  type  of  a  time  series  (long,  short,  or  no  memory)  and  the  type  of 
the  whitening  filter  of  a  short  memory  time  series  (AR,  MA,  or  ARMA)  must  be 
identified  to  arrive  at  the  final  form  of  spectral  estimator  in  an  applied 
problem.  Approximating  autoregressive  schemes  play  two  roles:  as  estimators, 
and  also  as  diagnostic  tools  in  the  process  of  model  identification. 

In  section  5  we  outline  the  application  to  model  Identification  of  the 
notions  of  information  divergence,  entropy,  and  cross-entropy  (which  for  Gaussian 
random  variables  are  expressed  in  terms  of  logarithms  of  mean  square  prediction 
errors).  The  approach  described  is  related  to  the  pioneering  work  of  Akaika(1977). 
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The  basic  aim  of  spectral  analysis  is  to  obtain  an  estimated  spectral  density 
with  the  "right"  degree  of  smoothness  in  the  sense  of  not  Introducing  spurious 
spectral  peaks,  and  also  resolving  nearby  spectral  peaks.  This  section  outlines 
our  approach  to  attaining  this  goal. 

A.  Pre-processing.  To  analyze  a  time  series  sample  Y(t),  t-1 . T,  one  will 

proceed  in  stages  which  often'  involve  the  substraction  of  or  elimination  of 
strong  effects  in  order  to  see  more  clearly  weaker  patterns  in  the  time  series 
structure. 

The  aim  of  pre-processing  is  to  transform  Y(*)  to  a  new  time  series  Y(-) 
which  is  short  memory.  Some  basic  pre-processing  operations  are  memory  less 
transformation  (such  as  square  root  and  logarithm),  detrending,  "high  pass" 
filtering,  and  differencing.  One  usually  subtracts  out  the  sample  mean 
1  T 

Y  *  —  £  Y(t);  then  the  time  series  actually  processed  is  Y(t)  -  Y.  If  the 

t=l 

mean  Y  is  a  large  number,  it  should  be  subtracted;  the  variations  in  Y(t)  are 
then  the  variations  of  Y(t)  about  its  mean.  The  sample  mean  Y  and  sample 
variance  ft(0)  should  always  be  recorded. 

B.  Sample  Fourier  Transform  by  Data  Windowing,  Extending  with  Zeroes,  and  Fast 
Fourier  Transform.  Let  Y(t)  denote  a  pre-processed  time  series.  The  first  step 
in  the  analysis  could  be  to  compute  successive  autoregressive  schemes  using 
domain  operations  only  in  the  time.  An  alternative  first  step  is  the  computation 
of  the  sample  Fourier  transform 

T 

(1)  <Ku>)  -  ][  Y (t)  exp  (~2*iwt) 

t-1  k 

at  an  equi-spaced  grid  of  frequencies  in  0<u><1,  of  the  form  w  -  Q, 
k-0,...,Q-l.  We  call  Q  the  spectral  computation  number.  One  should  always 
choose  Q  _>  T,  and  we  recommend  Q  j>  2T. 

Prior  to  computing  <K<d),  one  should  extend  the  length  of  the  time  series  by 
addition  zeroes  to  it.  Then  <ji(w),  <*>«  can  be  computed  using  the  Fast  Fourier 
transform. 

If  the  time  series  may  be  long  memory  one  should  compute  in  addition  a  sample 
"data  windowed"  Fourier  transform 
T 

(2)  ♦„((!))  -  l  Y ( t ) W (— )  exp  (-2jriwt) 

t=l 

To  understand  the  effect  of  the  window,  one  replaces  Y(t)  by  a  spectral 
representation  Y(t)  ■  exp  (2*iAt)  dY(A);  then 

iu(tu)  =  / 1  w  (a >-X)  dY(A)  where  w  (A)  -  J  W(^)  exp  (-27riAt) . 

-  °  1  1  t*l  1 

Considerations  involved  in  the  choice  of  data  windows  are  discussed  in  Harris 

(I978). 

C.  Sample  Spectral  Density.  The  sample  spectral  density  f(u»)  is  obtained 
essentially  by  squaring  and  normalizing  the  sample  Fourier  transform; 

(3)  f(w) - Ifwt  -  •-""$»  k  "  O'  1 . Q'1’ 

b  1  li&i2 


Over  the  domain  0£u<l,  it  integrates  to  1. 

D.  Sample  Correlation  Function.  The  sample  correlation  function 
T-v  T  „ 

p(v)  -  l  Y(t)  Y(t+v)  *  l  Y  (t)  is  computed  (using  the  Fast  Fourier 
t-1  t-1 

Transform)  by 
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.  i  Q-l  . 

p(y)  "  n  £  exp  (2ffiov>  f<H>  » 

^  k-0  W  Q 

which  holds  for  0  <_  v  <Q-T  “  M.  I  recommend  always  taking  H  T/3,  and  often  one 
should  choose  M  £  T(2/3). 

E.  Sample  Spectral  Distribution  Function. 

F((u)  *  2 f(u>')du)\  0<uj<_  0.5  ; 

the  graph  of  F(oj)  provides  qualitative  diagnostics  of  the  time  series  model  type. 

F.  Nonparametric  kernel  spectral  density  estimator.  An  estimator  £(u>)  of  the 
spectral  density  is  called:  parametric  when  it  corresponds  to  a  parametric  model 
for  the  time  series  (such  as  an  AR  or  ARMA  model);  non- parametric  otherwise.  A 
general  form  of  non-parametric  estimator  is  the  kernel  estimator 

f(w)  =  l  k(^)  p(v)  e_2lTl“vt  0<m<1. 


Two  popular  choices  of  kernel  are  the  Parzen  window  [Parzen  (1961)] 
k(t)  =  l-6t2  +  6t3  ,  }t|  <  0.5 

-  2  (1  -  |t|)3 
**  0 

and  the  spline-equivalent  window  [Parzen  (1958) ,  Cogburn  and  Davis  (1974), 
Wahba  (1980) ] 
k(t)  = 


0.5  <  |t  <  1 


1+t 


2r 


where  r  >  2  is  usually  chosen  to  equal  2  or  4.  The  problem  of  determining  opti¬ 
mum  truncations  points  M  has  no  general  solution;  one  approach  is  to  choose  a 
large  value  of  M  to  obtain  a  preliminary  smoothing  of  the  sample  spectral  density. 

G.  Autoregressive  analysis.  The  Yule-Walker  equations  are  solved  to  estimate 
innovation  variances  to  which  are  applied  order  determining  criteria  (AJC, 

CAT)  to  determine  optimal  orders  m  and  also  to  test  for  white  noise.  The  value 
of  6?  and  the  dynamic  range  of  the  autoregressive  spectral  estimator  £„,((!))  are 
used  to  determine  the  memory  type  of  the  time  series  [Parzen  (1982)]. 

H.  Non-stationary  autoregression.  When  a  time  series  is  classified  as  long 
memory,  more  accurate  estimators  of  autoregressive  coefficients  are  provided  by 
minimizing  a  "forward  and  backward"  least  squares  criterion 

l  (Y(t)  +  am(l)  Y(t-l)+...um(m)  Y(t-m))2 

t=ni+1  +  (Y(t-m)  +am(l)  Y(t-nri-l)  +...+0^(111)  Y(t))2, 
or  by  Burg  estimators  [for  references  to  descriptions  of  Burg's  algorithm,  see 
Kay  and  Marple  (1981)]. 

When  several  harmonics  are  present  in  the  data,  whose  frequencies  are  dose 
together,  least  squares  autoregressive  coefficient  estimators  are  more  effective 
than  Yule-Walker  autoregressive  coefficient  estimators  in  providing  auto¬ 
regressive  spectral  estimators  which  exhibit  the  split  peaks  one  would  like  to 
see  in  the  estimated  spectral  density. 

I.  ARMA  analysis.  When  a  time  series  is  classified  as  short  memory  the 
approximating  AR  scheme  is  used  to  form  the  MA(®)  coefficients  which  are  used  to 
form  a  subset  regression  procedure  for  determining  the  best  fitting  ARMA  scheme, 
and  the  corresponding  ARMA  spectral  density  estimator. 

J.  Inverse  correlations  and  cepstral  correlations.  Estimators  of  pi(v)  and 
y(v)  are  computed  and  used  to  form  nonparametric  kernel  estimators  of  f”l(w)  and 
log  f(u>),  which  may  provide  additional  insights  into  the  peaks  and  troughs  to  be 
given  significance  in  the  final  estimator  of  the  spectrum. 

K.  Long  Memory  analysis.  In  the  long  memory  case,  one  may  want  to  represent 
Y(t)  as  S(t)  +  N(t),  a  long  memory  signal  plus  short  memory  noise.  An  approach 
to  this  problem  may  be  provided  by  treating  the  sample  spectral  density  values 


f(k/Q)  as  a  data  batch  to  be  studied  by  non-parametrlc  data  modeling  methods 
using  quantile  functions  [see  Parzen  (1979] .  The  details  of  such  methods  are 
under  development. 

Examples  of  the  use  of  our  approach  to  empirical  spectral  analysis  will  be 
provided  in  the  oral  presentation.  Examples  of  autoregressive  spectral 
estimators  are  given  by  Beamish  and  Priestley  (1981)  and  Pagano  (1980). 

5.  Time  series  model  identification  by  estimating  information  and  order 
determining  criteria 

Consider  two  probability  densities  f(y)  and  g(y),  where  -®<y<®.  Define: 
cross-entropy  of  g  with  respect  to  f 

(1)  H(f;  g)  -  /".{-log  g(y))f(y)  dy; 
entropy  of  f 

(2)  H(f )  »  H(f ;  f)  -  /"•  {-log  f (y)}f (y)  dy; 

Information  divergence  of  (a  model)  g  from  (a  true  density)  f 

i (f ;  g)  *  H(f;  g)  -  H(f ) 

<3)  =  O-iog  f(y)  dy. 

Information  divergence  has  some  (but  not  all  of  the  properties  of  a  distance, 
since  I(f;  g)  £  0,  I(f;  f)  “  0. 

The  information  about  a  continuous  random  variable  Y  in  a  continuous  random 
variable  X  is  defined  by  any  one  of  the  following  equivalent  symbols: 

1(Y|X)  -  i(fY|X;  fy). 

(4)  "  V(fY|X=x5  fY>  f  (y) 

"  /I  ^  fxw  Oy  fY|x=x(y){-log  f7|“(9T} 

Define  entropy  of  Y  and  conditional  entropy  of  Y  given  X  by 

(5)  H(Y)  =  H(fy) 

(6)  H(YlX)  -  H(fy|x)  =  \  H(fy|x=x) 

One  may  show  that  Ex[H(fy|x=x;  *y ) 1  =  H(Y),  which  yields  the  first  fundmanetal 
relation 

(7)  I(Y|X)  -  H(Y)  -  H(Y|X) 

When  X  and  Y  are  jointly  normal,  H(fy|x_x)  does  not  depend  on  x,  and  is  the 

entropy  of  a  normal  distribution.  Therefore  for  jointly  normal  random  variables 
X  and  Y  . 

H(Y)  -  y  log  E(Y)  +  j  (1  +  log  2x) 

(8)  H(Y|X)  ■=  ~  log  E(Y|X)  +  y  (1  +  log  2ir) 

I(Y|X)  -  -  |  log  £'1(Y)  e(y|x) 

We  use  E(Y)  to  denote  the  variance  of  Y,  and  E(Y|X)  to  denote  the  conditional 
variance  of  Y  given  X. 

When  Y  and  X  are  multivariate  normal  vectors,  £  denotes  the  covariance 
matrix,  and  one  can  show  that 

,0.  I(Y|X)  -  (-  y)  log  det  Z_1(Y)  E(Y|X) 

»  (-  -j)  sum  log  eigenvalues  Z  (Y)  E(Y|X) 

The  second  fundamental  relation  decomposes  information: 

(10)  l(Ylxp9  -  I(Y|X1)  +  I(Y|X1;  X1,X2>, 


defining  the  information  about  Y  in  Xj  conditional  on  X^ 

(11)  "  H(fY]X1)  *  H(fY|xi,  X^ 

-  H(Y|X1)-  H(Y|X!,  X2) 

The  fundamental  approach  of  information  number  model  identification  pro¬ 
cedures  uses  estimators  ItYjX^;  X^,  X2)  of  information  numbers  I(Y| X^X-^.Xj) 
as  test  statistics  for  discriminating  between  alternative  hypotheses  H,,:  Y 
depends  on  X^;  Y  depends  on  and  X^  In  other  words,  the  null  hypothesis 
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Hq:  X}  Is  "sufficient"  to  predict  Y  given  the  -information  base  X.  and  X.,  is 
equivalent  to  H,,:  I(y|Xx;  X^,X2)  -  0.  1  1 

Models  for  a  zero  mean  stationary  Gaussian  univariate  time  series  Y(t),  t-0, 

+ 1 , . . . ,  can  be  formulated  in  terms  of  information  numbers.  For  p«l,2,..., 
define 

(12)  Ip  -  I(Y|Y_1,...,Y_p)  -  I(Y(t)|Y(t-l) . Y(t-p)). 

In  words.  Ip  is  the  Information  about  Y(t)  in  the  p  most  recent  values.  Let  Y 
denote  the  Infinite  past  Y(t-l),  Y(t-2),...  .  Then  I,,  ■I(y|y'*)  “  Ip*  The 
hypothesis  that  the  time  series  Y(*)  is  an  AR(p),  autoregressive  scheme  of  order 
p,  is  equivalent  to 

(13)  i(y|y_1» . . *Y_p;  y”)  -  i.  -ip  -  o 

To  estimate  Ip,  one  estimates  the  coefficients  of  an  AR(p) .  Further 

(14)  Ip  -  -  \  log  o2  t  Vs  "  \  lo«  °£=  -  J  fo  lo8  f(“>>  dt*>- 

Given  a  time  series  sample,  Y(t),  t=l,2,...T,  of  length  T,  one  can  calculate 
successively  (using  Fast  Algorithms  such  as  the  Yule-Walker  equations) 
estimators  . 

(15)  ip  =  *  2  log  op,  P  “  1,  2,  ...  • 

which  can  be  regarded  as  test  statistics  for  testing  white  noise,  or  more  pre¬ 
cisely  AR(0)  against  AR(p).  The  work  of  Akaike  (1974,  1977)  and  Hannan  and 
Quinn  (1949)  leads  one  to  conjecture  that  a  universal  test  for  white  noise 
(whose  theory  needs  further  study)  is  of  the  form  (for  a  suitable  choice  of 
constant  c>0,  say  c=l) 

(16)  -  y  log  o2  -  ^  log  log  T  £  ^  for  P=1»  2»  •  •  • 

A  related  conjecture  is  that  optimal  orders  p  of  approximating  autoregressive 
schemes  can  be  identified  by  first  determining  the  orders  at  which  are  attained 
the  absolute  and  relative  minima  of  order  determining  criteria  which  determine 
orders  p  for  which  -  Ip  is  not  significantly  different  from  zero.  It  should 
be  emphasized  that  the  ultimate  decision  on  the  adequacy  of  a  model  should  be 
based  on  a  definition  of  "parsimony  of  a  model"  which  requires  that  the  spectral 
distribution  function  of  the  residuals  (Y|  variables  in  model) v  (t)  be 
"parsimoniously"  not  signficantly  different  from  white  noise. 

Akaike' s  order  determining  criterion  AIC  is  defined  by 

(17)  AIC  (m)  =  log  52  +  ; 

m  i 

it  seeks  to  determine  the  order  of  an  exact  autoregressive  scheme  which  the  time 
series  is  assumed  to  obey.  One  can  raise  the  objection  against  it  that  it  does 
not  consistently  estimate  the  order,  which  is  done  by  a  criterion  due  to  Hannan 
and  Quinn  (1979) t 

(18)  AICHQ(m)  -  log  o2  +  f  log  log  T 

m  X 

Parzen  (1974),  (1977)  introduced  an  approximating  autoregressive  order 
criterion  called  CAT  (criterion  autoregressive  transfer  function),  defined  by 

(19)  CAT(m)  »  i  f  (1-  b  0~2  -  (1-  =)  S;2  . 

1  j“l  J 

In  practice  CAT  and  AIC  lead  in  many  examples  to  exactly  the  same  orders.  It 
appears  reassuring  that  quite  different  conceptual  foundations  can  lead  to 
similar  conclusions  in  practice. 

An  important  application  of  fitting  an  approximating  autoregressive  scheme 
AR(p)  to  a  time  series  is  the  estimation  of  Information  numbers  which  are  used 
to  determine  the  goodness  of  fit  of  ARMA  (p,q)  schemes.  An  approach  to  ARMA 
modeling  can  be  based  on  estimating  I(Y|Y_lt . . . ,Y_p,  Yv_1, . . . ,Yv_qj  Y  ). 
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